# Document Visual Question Answering

Mlcd Vit Bigg Patch14 448
MIT
MLCD-ViT-bigG is an advanced Vision Transformer model enhanced with 2D Rotary Position Encoding (RoPE2D), excelling in document understanding and visual question answering tasks.
Text Recognition
M
DeepGlint-AI
1,517
3
Pixtral 12b Quantized.w8a8
Apache-2.0
INT8 quantized version based on mgoin/pixtral-12b, supports vision-text multimodal tasks with optimized inference efficiency
Image-to-Text Transformers English
P
RedHatAI
309
1
Qwen2.5 VL 3B Instruct Quantized.w8a8
Apache-2.0
Quantized version of Qwen/Qwen2.5-VL-3B-Instruct, supporting visual-text input and text output, with weights quantized to INT8 and activations quantized to INT8.
Image-to-Text Transformers English
Q
RedHatAI
274
1
Florence2 EntityExtraction
MIT
Florence-2 DocVQA is a document visual question answering model fine-tuned based on the Microsoft Florence-2-large model, specifically designed for handling question-answering tasks in document images.
Image-to-Text Transformers English
F
jena-shreyas
23
0
Udop Large 512 300k
MIT
UDOP is a universal document processing model that unifies vision, text, and layout, based on the T5 architecture, suitable for document AI tasks.
Image-to-Text Transformers
U
microsoft
264
32
Udop Large 512
MIT
UDOP is a universal document processing model that unifies vision, text, and layout, based on the T5 architecture, suitable for tasks such as document image classification, parsing, and visual question answering.
Image-to-Text Transformers
U
microsoft
193
5
Testdocumentquestionanswering
A document visual question answering model based on the LayoutLMv2 architecture, fine-tuned for DocVQA tasks
Image-to-Text Transformers
T
Dhineshk
16
0
Layoutlmv3 Finetuned Docvqa
Document question answering model fine-tuned based on LayoutLMv3-base, suitable for document visual question answering tasks
Image-to-Text Transformers
L
am-infoweb
22
3
Donut Base Finetuned Docvqa
A document Q&A model based on the Donut architecture, capable of extracting text information from images and answering questions
Image-to-Text Transformers
D
Xenova
114
16
Layoutlmv2 Base Uncased Finetuned Docvqa
A document visual question answering model based on the LayoutLMv2 architecture, fine-tuned specifically for document understanding tasks
Text-to-Image Transformers
L
madiltalay
14
0
Layoutlmv2 Base Uncased Finetuned Docvqa
A document visual question answering model based on the LayoutLMv2 architecture, specifically fine-tuned for document understanding tasks
Image-to-Text Transformers
L
hugginglaoda
16
0
Pix2struct Docvqa Base
Apache-2.0
Pix2Struct is an image encoder-text decoder model trained on image-text pairs, supporting various tasks including image captioning and visual question answering.
Image-to-Text Transformers Supports Multiple Languages
P
google
8,601
37
Pix2struct Docvqa Large
Apache-2.0
Pix2Struct is a vision-language model based on an image encoder-text decoder architecture, specifically fine-tuned for document visual question answering tasks
Image-to-Text Transformers Supports Multiple Languages
P
google
984
31
Layoutlmv2 Base Uncased Finetuned Docvqa V2
This model is a fine-tuned version of microsoft/layoutlmv2-base-uncased for document visual question answering tasks, focusing on processing text and layout information in document images.
Image-to-Text Transformers
L
MariaK
54
3
Layoutlm Invoices
A document QA model fine-tuned based on the LayoutLM architecture, specifically designed for processing structured documents like invoices
Text-to-Image Transformers English
L
faisalraza
100
7
Donut Base Finetuned Docvqa
MIT
Donut is an OCR-free document understanding Transformer model, fine-tuned on the DocVQA dataset, capable of directly extracting and comprehending text information from images.
Image-to-Text Transformers
D
naver-clova-ix
167.80k
231
Layoutlmv2 Large Uncased Finetuned Vi Infovqa
A document visual question answering model fine-tuned based on microsoft/layoutlmv2-large-uncased, suitable for Vietnamese information extraction tasks
Text-to-Image Transformers
L
tiennvcs
16
0
Layoutlmv2 Large Uncased Finetuned Infovqa
Document understanding model based on the LayoutLMv2 architecture, fine-tuned for InfoVQA tasks
Question Answering System Transformers
L
tiennvcs
16
2
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase